Financial Contributions to Presidential Campaigns (Ohio State)

Dataset: Financial Contributions to Presidential Campaigns (Ohio State)
Time: 2016
The reason to choose this dataset:
Ohio is known as a swing state which could forecast the election result by the status of Ohio state.

General R library and data loading & enrichment

Univariate Plots Section

## the total number of row in oh_data: 164475
##  [1] "cmte_id"           "cand_id"           "cand_nm"          
##  [4] "contbr_nm"         "contbr_city"       "contbr_st"        
##  [7] "contbr_zip"        "contbr_employer"   "contbr_occupation"
## [10] "contb_receipt_amt" "contb_receipt_dt"  "receipt_desc"     
## [13] "memo_cd"           "memo_text"         "form_tp"          
## [16] "file_num"          "tran_id"           "election_tp"      
## [19] "party"             "Month_Yr"          "Day_Month"        
## [22] "weekday"           "surname"           "gender"
##       cmte_id           cand_id                           cand_nm     
##  C00575795:71194   P00003392:71194   Clinton, Hillary Rodham  :71194  
##  C00577130:34686   P60007168:34686   Sanders, Bernard         :34686  
##  C00580100:24166   P80001571:24166   Trump, Donald J.         :24166  
##  C00574624:16406   P60006111:16406   Cruz, Rafael Edward 'Ted':16406  
##  C00573519: 7937   P60005915: 7937   Carson, Benjamin S.      : 7937  
##  C00581876: 4824   P60003670: 4824   Kasich, John R.          : 4824  
##  (Other)  : 5262   (Other)  : 5262   (Other)                  : 5262  
##                   contbr_nm          contbr_city     contbr_st  
##  STOWE, JANICE         :   277   COLUMBUS  : 17328   OH:164475  
##  MISSLER, ANDREW J. MR.:   203   CINCINNATI: 15630              
##  BRIONES, BERTA        :   179   CLEVELAND :  5778              
##  MOESER, DONALD        :   176   DAYTON    :  4634              
##  CUMMINGS, JOHN        :   142   TOLEDO    :  3287              
##  SCHEEL, PATRICK       :   133   AKRON     :  3206              
##  (Other)               :163365   (Other)   :114612              
##    contbr_zip                     contbr_employer 
##  Min.   :       10   RETIRED              :27097  
##  1st Qu.:431109498   N/A                  :22434  
##  Median :440942900   SELF-EMPLOYED        : 8353  
##  Mean   :368573923   NONE                 : 7638  
##  3rd Qu.:450131451   INFORMATION REQUESTED: 7611  
##  Max.   :458969665   (Other)              :91213  
##  NA's   :3           NA's                 :  129  
##              contbr_occupation contb_receipt_amt contb_receipt_dt    
##  RETIRED              :43434   Min.   :-10800    Min.   :2014-07-17  
##  NOT EMPLOYED         :10378   1st Qu.:    16    1st Qu.:2016-02-29  
##  INFORMATION REQUESTED: 7549   Median :    28    Median :2016-05-31  
##  ATTORNEY             : 3320   Mean   :   120    Mean   :2016-05-16  
##  HOMEMAKER            : 3234   3rd Qu.:    80    3rd Qu.:2016-08-25  
##  (Other)              :96538   Max.   : 29100    Max.   :2016-11-28  
##  NA's                 :   22                                         
##                      receipt_desc    memo_cd   
##                            :162495    :127925  
##  Refund                    :   887   X: 36550  
##  REDESIGNATION FROM PRIMARY:   211             
##  REDESIGNATION TO GENERAL  :   210             
##  REATTRIBUTION TO SPOUSE   :   114             
##  REATTRIBUTION FROM SPOUSE :   112             
##  (Other)                   :   446             
##                                memo_text       form_tp      
##                                     :114599   SA17A:128232  
##  * EARMARKED CONTRIBUTION: SEE BELOW: 33677   SA18 : 35356  
##  * HILLARY VICTORY FUND             : 14385   SB28A:   887  
##  EARMARKED FROM MAKE DC LISTEN      :   282                 
##  *BEST EFFORTS UPDATE               :   246                 
##  REDESIGNATION FROM PRIMARY         :   211                 
##  (Other)                            :  1075                 
##     file_num                       tran_id       election_tp   
##  Min.   :1003942   A80E77D0E713E417AA88:     3        :   522  
##  1st Qu.:1077664   C11887628           :     3   G2016: 56271  
##  Median :1096260   C10225661           :     2   P2016:107682  
##  Mean   :1095976   C10228611           :     2                 
##  3rd Qu.:1119042   C10230213           :     2                 
##  Max.   :1134173   C10234145           :     2                 
##                    (Other)             :164461                 
##     party              Month_Yr       Day_Month          weekday     
##  Length:164475      2016-10:18582   Min.   : 1.00   Monday   :26927  
##  Class :character   2016-07:18208   1st Qu.: 8.00   Tuesday  :29339  
##  Mode  :character   2016-03:16599   Median :15.00   Wednesday:29176  
##                     2016-08:14777   Mean   :16.04   Thursday :23544  
##                     2016-04:14059   3rd Qu.:25.00   Friday   :24160  
##                     2016-02:13335   Max.   :31.00   Saturday :16619  
##                     (Other):68915                   Sunday   :14710  
##    surname            gender.gender   
##  Length:164475      Length:164475     
##  Class :character   Class :character  
##  Mode  :character   Mode  :character  
##                                       
##                                       
##                                       
## 

By summary function to gain a general idea about the whole dataset

From the output and the definition of variables, I could know about the types of variables and decide the next exploration step.

the total number of row in oh_data: 164475 rows After enrichment, there are 24 variables.

The key questions I would like to anwser through this dataset are:
1) if there is any correlation between contributed amount and the voting result?
2) if there is any patterns for people donate funding? e.g. occupation, gender, city they live

Key variable: donation amount (contb_receipt_amt, numeric variable)
other numeric variable for exploring distribution: N/A
some important non-numeric variables: candidate names(cand_nm), gender(gender), occupation(contbr_occupation), cities (contbr_city), party(party)

## [1] -10800  29100

Distribution of donation amount

The distribution is quite spread and there are some negative numbers due to refund. For having a better view on donation amount, I used natural logarithm, log base 10, to transform my plot. With logarithm, I can see that the most common donation amount is around US$50.1 (10^1.7)

##  [1] COLUMBUS   COLUMBUS   CINCINNATI COLUMBUS   COLUMBUS   COLUMBUS  
##  [7] CINCINNATI AKRON      DAYTON     COLUMBUS   COLUMBUS   COLUMBUS  
## [13] TOLEDO     COLUMBUS   COLUMBUS  
## 1341 Levels:  BATAVIA 45320 ABERDEEN ADA ADAMS COUNTY ADDYSTON ... ZOAR
##  [1] COLUMBUS   COLUMBUS   CINCINNATI COLUMBUS   COLUMBUS   COLUMBUS  
##  [7] CINCINNATI AKRON      DAYTON     COLUMBUS   COLUMBUS   COLUMBUS  
## [13] TOLEDO     COLUMBUS   COLUMBUS  
## 10 Levels: COLUMBUS CINCINNATI CLEVELAND DAYTON TOLEDO ... LAKEWOOD

## 24  unique candidates
## 6555 unique occupations
## 1341 unique contributed cities

Basic idea on non-numeric variables: candidates, occupation, cities

By plotting the bar charts and counting unique numbers of these non-numeric variables, there are too many unique data in terms of occupations and cities. It is difficult to read data from the graphs if ploting all occupations or cities so I plotted top 15 occupations and top 10 cities which contributed the most funding.

In terms of candidates, there are only 24 unique candidates so I used abbreviation of each candidate’s names to plot a bar chart. From the bar chart, C.HR got the most contributed amount in Ohio state.

Party distribution

Although there are more donation records for Democratic party, there are more donated amount for Republican party. It might be caused by the average donation to Republican is higher.

Gender distribution

The proportion of gender is almost equal (female : male is around 5 : 5)

## # A tibble: 1,341 × 2
##       contbr_city total_amount
##            <fctr>        <dbl>
## 1      CINCINNATI    2605688.7
## 2        COLUMBUS    2226563.1
## 3       CLEVELAND     866239.9
## 4   CHAGRIN FALLS     383091.9
## 5          DUBLIN     379636.9
## 6  SHAKER HEIGHTS     376150.9
## 7           AKRON     358729.6
## 8          DAYTON     353846.1
## 9          CANTON     277801.2
## 10    WESTERVILLE     254291.7
## # ... with 1,331 more rows
## # A tibble: 10 × 2
##       contbr_city total_amount
##            <fctr>        <dbl>
## 1      CINCINNATI    2605688.7
## 2        COLUMBUS    2226563.1
## 3       CLEVELAND     866239.9
## 4   CHAGRIN FALLS     383091.9
## 5          DUBLIN     379636.9
## 6  SHAKER HEIGHTS     376150.9
## 7           AKRON     358729.6
## 8          DAYTON     353846.1
## 9          CANTON     277801.2
## 10    WESTERVILLE     254291.7

City level: donation records vs. donation amounts

I listed top 10 cities in terms of donation records and donation amounts. Take Columbus as an example, there are the most donation records among the cities but the donation amount is not the top 1 city. It shows that some cities might have more relatively small amount of donation.

Univariate Analysis

What is the structure of your dataset?

There are 164,475 obs in the Ohio dataset with 18 original varibles. For analysis purpose, I added 6 extra varibles (party, Month_Yr, weekday, day of month, surname and gender)

What is/are the main feature(s) of interest in your dataset?

The main features in the data set are “contb_receipt_amt” and the factors influencing the amounts. I’d like to find out which features have the most impact on raising more contributed amounts and I’d like to provide a few suggestions for candidates in the future when running a election found-raising campaign. I suspect city, occupation and day of week matter.

What other features in the dataset do you think will help support your investigation into your feature(s) of interest?

Since 2016 American presidential election result has came out, it would be great to do comparison analysis between contributed amount data and the final voting result data. I downloaded the voting result data for analyzing the correlation between contributed amount and the voters in Ohio. (The analysis is covered in the next section.)

Did you create any new variables from existing variables in the dataset?

Yes, I create 3 variables for further analysis. The 3 variables are listed below.
1) Party: I categorized data into 3 categories(D, R, Other) based on candidate name
2) Month_Yr: showing the contributed amount trend by month
3) weekday: analyzing if there is a huge difference between weekday and weekend.
4) Day_Month: the day of month 5) surname: for predicting the gender by gender library 6) gender: the gender of the contributors

Of the features you investigated, were there any unusual distributions? Did you perform any operations on the data to tidy, adjust, or change the form of the data? If so, why did you do this?

I enriched the Ohio dataset with Zipcode to visualize the contributed amount on Ohio map.(The analysis is conducted in multivariate plots section.)

After merging with Ohio zipcode data from Zipcode library, I found there are 83 potential wrong zipcode data so I excluded them when I was plotting the contributed amount on the map. The reason why I excluded is that it is hard to identify the correct zipcode simply based on city names.

Bivariate Plots Section

## Source: local data frame [2,475 x 4]
## Groups: contbr_city [?]
## 
##     contbr_city      party count total_amount
##           <chr>      <chr> <int>        <dbl>
## 1       batavia Republican     1       500.00
## 2         45320 Republican     1        80.00
## 3      aberdeen Democratic     5       900.00
## 4      aberdeen Republican     2        44.00
## 5           ada Democratic    97      4272.00
## 6           ada      Other    43      3682.88
## 7           ada Republican    18      1458.00
## 8  adams county Republican     1        80.00
## 9      addyston Democratic    11       392.55
## 10     addyston Republican     3       190.00
## # ... with 2,465 more rows
##    contbr_city Democratic   Other Republican
## 1      batavia       0.00    0.00        500
## 2        45320       0.00    0.00         80
## 3     aberdeen     900.00    0.00         44
## 4          ada    4272.00 3682.88       1458
## 5 adams county       0.00    0.00         80
## 6     addyston     392.55    0.00        190

Bivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. How did the feature(s) of interest vary with other features in the dataset?

I noticed that the relationship between distributed amount and the number of voters is not positively strong. It seems to have week relationship which is against my original assumption.

Did you observe any interesting relationships between the other features (not the main feature(s) of interest)?

When dicussing the relationship between the contributed amount and the toal voters, Republican party supporters show stronger correlation than Democratic party supporters.

The correlation coefficient between contributed amount and voter numbers 1)Republican party : 0.401 2)Democratic party : 0.184

The correlation coefficient is higher than the correlation coefficient of total contributed amount and total voter numbers in Ohio (which is 0.307)

What was the strongest relationship you found?

The relationship between the total contributed amount and the contributed amount of Republican party is super relative (the correlation coefficient is 0.934) because the contributed amount from Republican party supporters accounts for ~60%.

However, this is not a proper pair to check the relationship because these 2 factors are not independent.

Multivariate Plots Section

##                     cand_nm contbr_city contbr_zip contb_receipt_amt
## 1 Cruz, Rafael Edward 'Ted'    LEESBURG  451359416             25.00
## 2 Cruz, Rafael Edward 'Ted'     MINERVA  446579402             25.00
## 3   Clinton, Hillary Rodham    COLUMBUS  432141210             40.00
## 4          Sanders, Bernard    COLUMBUS  432022420             50.00
## 5   Clinton, Hillary Rodham     LEBANON  450365038             57.31
## 6          Sanders, Bernard  CINCINNATI      45249              2.50
##        party
## 1 Republican
## 2 Republican
## 3 Democratic
## 4      Other
## 5 Democratic
## 6      Other
## [1] 27392
## [1] 27309
## [1] 83

Multivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. Were there features that strengthened each other in terms of looking at your feature(s) of interest?

I noticed that the major cities account for more contributed amount. After visualing on the map, it shows clearly that there are a few of heat spots in Ohio.

Were there any interesting or surprising interactions between features?

After distinguishing the contributed amount by party, it shows that there are more funding going to Republican party and it refelects on voting result that Republican party won Ohio at the end.

OPTIONAL: Did you create any models with your dataset? Discuss the strengths and limitations of your model.

No. I tried to build a linear regression model between numeric and catergorical data but it failed and it seems to involve more complexing statistical library.


Final Plots and Summary

Plot One

Description One

The correlation coefficient between contributed amount and voter numbers 1)Republican party : 0.401 (plot 1-2) 2)Democratic party : 0.184 (plot 1-3)

The correlation coefficient is higher than the correlation coefficient of total contributed amount and total voter numbers in Ohio (which is 0.307, plot 1-1)

Plot Two

Description Two

Based on the analysis of contributed amount by weekday, it shows that there is lower contributed amount on weekend. This might cause by the reason that people tend to leave their weekend time for family. I would suggest to set some stops in the places where people love to go with their family during weekend. It might help to increase the funding rose on weekend.

Plot Three

Description Three

It shows that the contributed money is mainly from city area such as Columbus, Cleveland, Akron and Cincinnati etc. It helps candidates to identify the cities to plan their future campaigns for raising more funding.

I distinguish the funding for Republican party and Democratic party by color in Plot3-1. It shows that there are more funding for Republican party in Ohio and the voting result also shows that Republican party won Ohio state.


Reflection

Before starting the analysis, I assumed that the contributed amount would be a strong indicator for election result. After analyzing the relationship between the election result of Ohio and the contributed amount data of Ohio. The correlation coefficient between these 2 factors are lower than I expected and it can’t be suspected as having strong correlation between contributed amount and voter numbers.

However, this is only analyzing one state. I think, for optimizing/ further analayzing, I would suggest to analyze the data of all states in the U.S. to see if there are any strong relationship between these 2 factors.

During the analysis, I was quite struggling with more than 6,000 occupations which I thought there might be some insights to br cracked. It would be better if there are some default options for people to choose while they are making donation, such as “Retired”, “Public Servant”, “Military Soldiers” or “Teachers” etc. I could cross-check with each party’s party platform to see if party platform have any impact on donation amounts by occupation.